Explore and Summarize Data: Red Wine Dataset by John Ortiz ========================================================

Introduction

For this project, I will analyze the red wine dataset. I will try to determine what values of each variable make for the quailty of the wine. First I will perform a univariate, bivariate, and multivariate analysis.

Dataset Background: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The dataset is related to a red variant of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

# Load the Data
wine <- read.csv('wineQualityReds.csv')

Structure and summary of the Dataframe

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Let’s do some modifications to the dataset. Quality seems like it would best fit as a ordered factor. Let’s us also add a new ordered factored variable called ‘rating’ based on the quailty of the wine.

#Convert Quality to a Factor
wine$quality <- factor(wine$quality, ordered = T)
#Create 'Rating' as an ordered factor
wine$rating <- ifelse(wine$quality < 5, 'bad', ifelse(
  wine$quality < 7, 'average', 'good'))
wine$rating <- ordered(wine$rating,
                       levels = c('bad', 'average', 'good'))

Univariate Plots

Let’s do a plot the distribution of each of the variable. Also, there are extreme outliers for many variables, let’s remove them so a better analysis can be performed.

INSIGHT: Most of the wines present in the dataset are average quality wines.

INSIGHT: Fixed Acidity is positively skewed. The median is around 8ish.

INSIGHT: There seems to be a Bimodal distribution for Volatile acidity, with peaks at around 0.375 and 0.625.

INSIGHT:The distribution of Citric acid looks a bit off. The seem to be missing data in the higher values. Maybe the data collected was incomplete?

INSIGHT: Residual Sugar is positively skewed with high peaks at around 2.2. There are a lot of outliers present at the higher values.

INSIGHT: Chlorides is positively skewed with high peaks at around 0.25.

INSIGHT: Free Sulphur Dioxide has a high peak at around 5. It is also positively skewed with a long tail. Some outliers in the high values are present

INSIGHT: Total Sulphur Dioxide has a high peak at around 25. It is also positively skewed with a long tail. Some outliers in the high values are present

INSIGHT: Density seems to have a normal Distribution.

INSIGHT: pH seems to have a normal Distribution.

INSIGHT: Sulphates has long tailed distribution.

INSIGHT: Alcohol has a skewed distribution.

Analysis of the Univariate Plots

What is the structure of your dataset?

This Dataset has 1599 rows and 14 columns.

Categorical variables are ‘quality’ and ‘rating’. The rest of the variables are numerical variables which detail the chemical and physical properties of the wine.

What is/are the main feature(s) of interest in your dataset?

My main focus in this dataset project is the ‘quality’ of the wine. What is it excalty that drives the quailty of the wine.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

After looking at the distributions of the varible I think the acidity of the wine is the biggest factor of it’s quailty. Since pH is related to acidity, I also think it may have some effect on the quality. Residual sugar might also have an effect on wine quality, but from the little I know of wine, sugars should not have much of an effect on quality but on whether how ‘dry’ it is.

Did you create any new variables from existing variables in the dataset?

I did do some modifications to the dataset. Quality seems like it would best fit as a ordered factor so I converted it. I also added a new ordered factored variable called ‘rating’ based on the quailty of the wine.

Of the features you investigated, were there any unusual distributions?

Residual sugar and Chloride seems to have extreme outliers present. Density and pH looks to be normally distributed with few outliers. Total and free sulfur dioxides, alcohol and sulphates, and Fixed and volatile acidity seem to be long-tailed. What was unusual was Citric acid. It has large number of zero values! It also showed a unique distribution; a mostly rectangular shape.

Bivariate Plots

For this section I think a correlation matrix should be created to start off. This matrix should give me an idea about which varibles may be correlated.

First let’s remove some columns and convert quality back to a number!

#remove some columns and convert quality back to a number
c <- cor(
  wine %>%
    dplyr::select(-X) %>%
    dplyr::select(-rating) %>%
    mutate(
      quality = as.numeric(quality)
    )
)
#create correlation matrix
emphasize.strong.cells(which(abs(c) > .3 & c != 1, arr.ind = TRUE))
pandoc.table(c)
## 
## ---------------------------------------------------------------------------
##           &nbsp;            fixed.acidity   volatile.acidity   citric.acid 
## -------------------------- --------------- ------------------ -------------
##     **fixed.acidity**             1             -0.2561        **0.6717**  
## 
##    **volatile.acidity**        -0.2561             1           **-0.5525** 
## 
##      **citric.acid**         **0.6717**       **-0.5525**           1      
## 
##     **residual.sugar**         0.1148           0.001918         0.1436    
## 
##       **chlorides**            0.09371           0.0613          0.2038    
## 
##  **free.sulfur.dioxide**       -0.1538          -0.0105         -0.06098   
## 
##  **total.sulfur.dioxide**      -0.1132          0.07647          0.03553   
## 
##        **density**            **0.668**         0.02203        **0.3649**  
## 
##           **pH**             **-0.683**          0.2349        **-0.5419** 
## 
##       **sulphates**             0.183            -0.261        **0.3128**  
## 
##        **alcohol**            -0.06167          -0.2023          0.1099    
## 
##        **quality**             0.1241         **-0.3906**        0.2264    
## ---------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## ------------------------------------------------------------------------------
##           &nbsp;            residual.sugar   chlorides    free.sulfur.dioxide 
## -------------------------- ---------------- ------------ ---------------------
##     **fixed.acidity**           0.1148        0.09371           -0.1538       
## 
##    **volatile.acidity**        0.001918        0.0613           -0.0105       
## 
##      **citric.acid**            0.1436         0.2038          -0.06098       
## 
##     **residual.sugar**            1           0.05561            0.187        
## 
##       **chlorides**            0.05561           1             0.005562       
## 
##  **free.sulfur.dioxide**        0.187         0.005562             1          
## 
##  **total.sulfur.dioxide**       0.203          0.0474         **0.6677**      
## 
##        **density**            **0.3553**       0.2006          -0.02195       
## 
##           **pH**               -0.08565        -0.265           0.07038       
## 
##       **sulphates**            0.005527      **0.3713**         0.05166       
## 
##        **alcohol**             0.04208        -0.2211          -0.06941       
## 
##        **quality**             0.01373        -0.1289          -0.05066       
## ------------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------------------
##           &nbsp;            total.sulfur.dioxide     density         pH      
## -------------------------- ---------------------- ------------- -------------
##     **fixed.acidity**             -0.1132           **0.668**    **-0.683**  
## 
##    **volatile.acidity**           0.07647            0.02203       0.2349    
## 
##      **citric.acid**              0.03553          **0.3649**    **-0.5419** 
## 
##     **residual.sugar**             0.203           **0.3553**     -0.08565   
## 
##       **chlorides**                0.0474            0.2006        -0.265    
## 
##  **free.sulfur.dioxide**         **0.6677**         -0.02195       0.07038   
## 
##  **total.sulfur.dioxide**            1               0.07127      -0.06649   
## 
##        **density**                0.07127               1        **-0.3417** 
## 
##           **pH**                  -0.06649         **-0.3417**        1      
## 
##       **sulphates**               0.04295            0.1485        -0.1966   
## 
##        **alcohol**                -0.2057          **-0.4962**     0.2056    
## 
##        **quality**                -0.1851            -0.1749      -0.05773   
## -----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -------------------------------------------------------------------
##           &nbsp;            sulphates      alcohol       quality   
## -------------------------- ------------ ------------- -------------
##     **fixed.acidity**         0.183       -0.06167       0.1241    
## 
##    **volatile.acidity**       -0.261       -0.2023     **-0.3906** 
## 
##      **citric.acid**        **0.3128**     0.1099        0.2264    
## 
##     **residual.sugar**       0.005527      0.04208       0.01373   
## 
##       **chlorides**         **0.3713**     -0.2211       -0.1289   
## 
##  **free.sulfur.dioxide**     0.05166      -0.06941      -0.05066   
## 
##  **total.sulfur.dioxide**    0.04295       -0.2057       -0.1851   
## 
##        **density**            0.1485     **-0.4962**     -0.1749   
## 
##           **pH**             -0.1966       0.2056       -0.05773   
## 
##       **sulphates**             1          0.09359       0.2514    
## 
##        **alcohol**           0.09359          1        **0.4762**  
## 
##        **quality**            0.2514     **0.4762**         1      
## -------------------------------------------------------------------

INSIGHT: Quality is strongly correlated to Alcohol and Volatile acidity.

INSIGHT: Density and Fixed Acidity have a very strong correlation.

INSIGHT: Naturally, Alcohol has negative correlation with density.

INSIGHT: Very unsual! Volatile acidity has a positive correlation with pH… I thought pH and acidity had a negative relationship!

It seems like Box plots between these variables are in order!

INSIGHT: Fixed Acidity has almost no effect on the Quality.

INSIGHT: Volatile acid and quality seem to have a negative coorelation.

INSIGHT: Citric acid and quailty seem to have a positive correlation.

INSIGHT: My previous thought that Residual Sugar may have no effect an effect on the wine quality seems to be true!

INSIGHT: There seems to be a weak negative correalation between chlorides and quality.

INSIGHT: Very low concentrations of Free Sulphur Dioxide produces low quality wine. And very high concentrations produces average wine.

INSIGHT: Very low concentrations of total Sulphur Dioxide produces low quality wine. And very high concentrations produces average wine.

INSIGHT: Lower densities seem to favor higher quailty wines.

INSIGHT: Higher Quailty wines seem to contain less pH

INSIGHT: Why are we seeing that Volatile Acid and pH have a positive correlation? Let’s pose this for further investigation.

INSIGHT: Even though we see many outliers in the ‘Average’ quality wine, it seems that better wines have a stronger concentration of Sulphates.

INSIGHT: Higher quality wines seem to have higher Alcohol content in it. Below are some statitics to help clarify.

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.12503    0.17471  -0.716    0.474    
## alcohol      0.36084    0.01668  21.639   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16

INSIGHT: The value of R squared seems to suggest that Alcohol contributes to only about 23% of the quality. What about the other 77%? Let continue to dig. Let’s do a correlation test quailty against each variable.

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
## log10.residual.sugar      log10.chlordies  free.sulfur.dioxide 
##           0.02353331          -0.17613996          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##      log10.sulphates              alcohol 
##           0.30864193           0.47616632

INSIGHT: The correlation test shows that the following variables have a high correlation to quality.

  1. Alcohol
  2. Sulphates(log10)
  3. Volatile Acidity
  4. Citric Acid

Analysis of Bivariate Plots

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in the dataset? What was the strongest relationship you found?

Higher concentration of Citric Acid seems to produce higher quality. Higher alcohol percentages seems to produce higher quality. Lower percent of Chloride seems to produce higher quality. Lower densities seems to produce higher quality. The more acidic seems to produce higher quality. Lower Volatile Acidity seems to produce higher quality. Residual sugar seems to have almost no effect on quality. Fixed Acidity seems to have almost no effect on quality.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

For some reason volatile acidity has a positive correlation with pH…

Multivariate Plots

The data shows that alcohol plays a strong part in the quality of the wine, so will make alcohol a constant and apply a few more variables to see if it effects quality.

INSIGHT: The quality of the alcohol does seem be effected by Density.

INSIGHT: It seems like a high the level of Sulphates paired with higher alcohol content produce higher quality.

INSIGHT: It seems like a low the level of Volatile acid paired with higher alcohol content produce higher quality.

INSIGHT: It seems like a low pH level paired with higher alcohol content produce higher quality.

INSIGHT: No correlation between residual sugar and quality keeping alcohol constant.

INSIGHT: Lower Sulphur Dioxide seems to produces better wine

Now we will see the effect of Acids on the Quality.

INSIGHT: Higher Citric Acid and low Volatile Acid seems to produce better Wines

INSIGHT: Nothing here.

INSIGHT: Nothing here.

Linear Modelling

Now after all these analysis, I am going to take the variables which are most strongly correlated with the quality of the wine and generate a linear model with them.

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = tr_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = tr_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = tr_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = tr_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid + fixed.acidity, data = tr_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH, 
##     data = tr_data)
## 
## ====================================================================================================
##                          m1            m2           m3           m4           m5           m6       
## ----------------------------------------------------------------------------------------------------
##   (Intercept)           0.155        -0.273        0.866***     0.973***     0.497        1.494**   
##                        (0.220)       (0.224)      (0.247)      (0.254)      (0.287)      (0.515)    
##   alcohol               0.333***      0.320***     0.286***     0.284***     0.296***     0.339***  
##                        (0.021)       (0.021)      (0.020)      (0.020)      (0.020)      (0.021)    
##   sulphates                           0.855***     0.599***     0.650***     0.667***     0.733***  
##                                      (0.126)      (0.124)      (0.127)      (0.126)      (0.129)    
##   volatile.acidity                                -1.153***    -1.279***    -1.352***               
##                                                   (0.124)      (0.143)      (0.144)                 
##   citric.acid                                                  -0.231       -0.629***               
##                                                                (0.132)      (0.174)                 
##   fixed.acidity                                                              0.058***               
##                                                                             (0.017)                 
##   pH                                                                                     -0.569***  
##                                                                                          (0.149)    
## ----------------------------------------------------------------------------------------------------
##   R-squared             0.209         0.245        0.308        0.310        0.319        0.256     
##   adj. R-squared        0.208         0.243        0.306        0.307        0.315        0.254     
##   sigma                 0.707         0.691        0.662        0.661        0.657        0.686     
##   F                   252.335       155.125      141.769      107.317       89.264      109.700     
##   p                     0.000         0.000        0.000        0.000        0.000        0.000     
##   Log-likelihood    -1027.549     -1004.996     -963.139     -961.610     -955.575     -997.782     
##   Deviance            478.652       456.660      418.487      417.154      411.937      449.841     
##   AIC                2061.098      2017.992     1936.279     1935.219     1925.150     2005.565     
##   BIC                2075.695      2037.456     1960.608     1964.415     1959.211     2029.894     
##   N                   959           959          959          959          959          959         
## ====================================================================================================

Analysis of the Multivariate Plots

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest? Were there any interesting or surprising interactions between features?

High Alcohol coupled with high Sulaphate seems to produce higher quality. High Alcohol coupled with Citric Acid seems to show a weak positive correlation for higher quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths

and limitations of your model.

Using the data insights gathered, a linear model was created. But the model did suffer from a limitation. Which was that the majority of the data was for average quality wines. Therefore a significant confidence level in the equations was not produced. This is explained by the low R squared value obtained; alcohol only contributes to only 23% of the Quality. A more complete dataset with more ‘good’ and ‘bad’ quality wines is needed.

Final Plots and Summary

1- Alcohol and Sulphates seems to be important in producing higher wine quality.

2- The linear model showed us the variation in the error percentage with different qualities of Wine, so plot of error value against the quality should useful.

Plot 1

INSIGHT: It’s easy to see that Alcohol has a positive relationship with quality. The more the alcohol is present, the higher the quality. But the R Squared value shows that alcohol is only responsible for about 23% in the variance of the quality.

Plot 2

INSIGHT: This plot shows that wines with both high values for Alcohol percentage and Sulphates seems to yield higher quality wine.

Plot 3

INSIGHT: Since most of the data is for ‘Average’ quality wines, it makes sense that the error is super dense around that section. m5, the best fit linear model with the highest R squared value could only explain 32% change in quality. So to summarize, predicting both ‘Good’ and ‘Bad’ quality wines using this model is ill advised.

Reflections

Since most of the data collected was for ‘average’ quality, the training set was not helpful in building a model that can accurately predict the quality of wine in the extreme ranges (‘bad’ and ‘good’ wines). A data set with more infomation on bad and good quality wines is needed.

My univariate analysis, showed that most variables displayed either a Positively skewed or a Normal Distribution. But the distribution of Citric acid looked a bit off, some wines samples had no citric acid present, seems that there is incomplete data present.

My bivariate analysis, plotting variables vs quality, showed that the most influential factors for quality were alcohol, sulphate and acid concentrations. Also a very weird relationship showed up, Volatile acidity had a positive correlation with pH… I thought pH and acidity had a negative relationship!

My multivariate analysis, showed that Alcohol has a positive relationship with quality. The more the alcohol is present, the higher the quality. But the R Squared value shows that alcohol is only responsible for about 23% in the variance of the quality. Density seemed to have no part in wine quality. And that high values for Alcohol percentage and Sulphates seems to yield higher quality wine.

Suggestions for the future: Maybe a more complete dataset should be collected. One with more values in the good and bad ranges. Also, Volatile acidity having a positive correlation with pH is super weird. Maybe an in-depth analysis of this should done.